Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 20 de 180
Filtrar
2.
iScience ; 27(5): 109570, 2024 May 17.
Artículo en Inglés | MEDLINE | ID: mdl-38646172

RESUMEN

The three-dimensional organization of genomes plays a crucial role in essential biological processes. The segregation of chromatin into A and B compartments highlights regions of activity and inactivity, providing a window into the genomic activities specific to each cell type. Yet, the steep costs associated with acquiring Hi-C data, necessary for studying this compartmentalization across various cell types, pose a significant barrier in studying cell type specific genome organization. To address this, we present a prediction tool called compartment prediction using recurrent neural networks (CoRNN), which predicts compartmentalization of 3D genome using histone modification enrichment. CoRNN demonstrates robust cross-cell-type prediction of A/B compartments with an average AuROC of 90.9%. Cell-type-specific predictions align well with known functional elements, with H3K27ac and H3K36me3 identified as highly predictive histone marks. We further investigate our mispredictions and found that they are located in regions with ambiguous compartmental status. Furthermore, our model's generalizability is validated by predicting compartments in independent tissue samples, which underscores its broad applicability.

3.
bioRxiv ; 2024 Apr 14.
Artículo en Inglés | MEDLINE | ID: mdl-38645064

RESUMEN

Over the past 15 years, a variety of next-generation sequencing assays have been developed for measuring the 3D conformation of DNA in the nucleus. Each of these assays gives, for a particular cell or tissue type, a distinct picture of 3D chromatin architecture. Accordingly, making sense of the relationship between genome structure and function requires teasing apart two closely related questions: how does chromatin 3D structure change from one cell type to the next, and how do different measurements of that structure differ from one another, even when the two assays are carried out in the same cell type? In this work, we assemble a collection of chromatin 3D datasets-each represented as a 2D contact map- spanning multiple assay types and cell types. We then build a machine learning model that predicts missing contact maps in this collection. We use the model to systematically explore how genome 3D architecture changes, at the level of compartments, domains, and loops, between cell type and between assay types.

4.
J Proteome Res ; 2024 Apr 30.
Artículo en Inglés | MEDLINE | ID: mdl-38687997

RESUMEN

Traditional database search methods for the analysis of bottom-up proteomics tandem mass spectrometry (MS/MS) data are limited in their ability to detect peptides with post-translational modifications (PTMs). Recently, "open modification" database search strategies, in which the requirement that the mass of the database peptide closely matches the observed precursor mass is relaxed, have become popular as ways to find a wider variety of types of PTMs. Indeed, in one study, Kong et al. reported that the open modification search tool MSFragger can achieve higher statistical power to detect peptides than a traditional "narrow window" database search. We investigated this claim empirically and, in the process, uncovered a potential general problem with false discovery rate (FDR) control in the machine learning postprocessors Percolator and PeptideProphet. This problem might have contributed to Kong et al.'s report that their empirical results suggest that false discovery (FDR) control in the narrow window setting might generally be compromised. Indeed, reanalyzing the same data while using a more standard form of target-decoy competition-based FDR control, we found that, after accounting for chimeric spectra as well as for the inherent difference in the number of candidates in open and narrow searches, the data does not provide sufficient evidence that FDR control in proteomics MS/MS database search is inherently problematic.

5.
bioRxiv ; 2024 Mar 04.
Artículo en Inglés | MEDLINE | ID: mdl-38496477

RESUMEN

The emergence of single-cell time-series datasets enables modeling of changes in various types of cellular profiles over time. However, due to the disruptive nature of single-cell measurements, it is impossible to capture the full temporal trajectory of a particular cell. Furthermore, single-cell profiles can be collected at mismatched time points across different conditions (e.g., sex, batch, disease) and data modalities (e.g., scRNA-seq, scATAC-seq), which makes modeling challenging. Here we propose a joint modeling framework, Sunbear, for integrating multi-condition and multi-modal single-cell profiles across time. Sunbear can be used to impute single-cell temporal profile changes, align multi-dataset and multi-modal profiles across time, and extrapolate single-cell profiles in a missing modality. We applied Sunbear to reveal sex-biased transcription during mouse embryonic development and predict dynamic relationships between epigenetic priming and transcription for cells in which multi-modal profiles are unavailable. Sunbear thus enables the projection of single-cell time-series snapshots to multi-modal and multi-condition views of cellular trajectories.

6.
Proteomics ; 24(8): e2300084, 2024 Apr.
Artículo en Inglés | MEDLINE | ID: mdl-38380501

RESUMEN

Assigning statistical confidence estimates to discoveries produced by a tandem mass spectrometry proteomics experiment is critical to enabling principled interpretation of the results and assessing the cost/benefit ratio of experimental follow-up. The most common technique for computing such estimates is to use target-decoy competition (TDC), in which observed spectra are searched against a database of real (target) peptides and a database of shuffled or reversed (decoy) peptides. TDC procedures for estimating the false discovery rate (FDR) at a given score threshold have been developed for application at the level of spectra, peptides, or proteins. Although these techniques are relatively straightforward to implement, it is common in the literature to skip over the implementation details or even to make mistakes in how the TDC procedures are applied in practice. Here we present Crema, an open-source Python tool that implements several TDC methods of spectrum-, peptide- and protein-level FDR estimation. Crema is compatible with a variety of existing database search tools and provides a straightforward way to obtain robust FDR estimates.


Asunto(s)
Algoritmos , Péptidos , Bases de Datos de Proteínas , Péptidos/química , Proteínas/análisis , Proteómica/métodos
7.
Nature ; 626(8001): 1084-1093, 2024 Feb.
Artículo en Inglés | MEDLINE | ID: mdl-38355799

RESUMEN

The house mouse (Mus musculus) is an exceptional model system, combining genetic tractability with close evolutionary affinity to humans1,2. Mouse gestation lasts only 3 weeks, during which the genome orchestrates the astonishing transformation of a single-cell zygote into a free-living pup composed of more than 500 million cells. Here, to establish a global framework for exploring mammalian development, we applied optimized single-cell combinatorial indexing3 to profile the transcriptional states of 12.4 million nuclei from 83 embryos, precisely staged at 2- to 6-hour intervals spanning late gastrulation (embryonic day 8) to birth (postnatal day 0). From these data, we annotate hundreds of cell types and explore the ontogenesis of the posterior embryo during somitogenesis and of kidney, mesenchyme, retina and early neurons. We leverage the temporal resolution and sampling depth of these whole-embryo snapshots, together with published data4-8 from earlier timepoints, to construct a rooted tree of cell-type relationships that spans the entirety of prenatal development, from zygote to birth. Throughout this tree, we systematically nominate genes encoding transcription factors and other proteins as candidate drivers of the in vivo differentiation of hundreds of cell types. Remarkably, the most marked temporal shifts in cell states are observed within one hour of birth and presumably underlie the massive physiological adaptations that must accompany the successful transition of a mammalian fetus to life outside the womb.


Asunto(s)
Animales Recién Nacidos , Embrión de Mamíferos , Desarrollo Embrionario , Gástrula , Análisis de la Célula Individual , Imagen de Lapso de Tiempo , Animales , Femenino , Ratones , Embarazo , Animales Recién Nacidos/embriología , Animales Recién Nacidos/genética , Diferenciación Celular/genética , Embrión de Mamíferos/citología , Embrión de Mamíferos/embriología , Desarrollo Embrionario/genética , Gástrula/citología , Gástrula/embriología , Gastrulación/genética , Riñón/citología , Riñón/embriología , Mesodermo/citología , Mesodermo/enzimología , Neuronas/citología , Neuronas/metabolismo , Retina/citología , Retina/embriología , Somitos/citología , Somitos/embriología , Factores de Tiempo , Factores de Transcripción/genética , Transcripción Genética , Especificidad de Órganos/genética
8.
Lab Invest ; 104(1): 100282, 2024 01.
Artículo en Inglés | MEDLINE | ID: mdl-37924947

RESUMEN

Large-scale high-dimensional multiomics studies are essential to unravel molecular complexity in health and disease. We developed an integrated system for tissue sampling (CryoGrid), analytes preparation (PIXUL), and downstream multiomic analysis in a 96-well plate format (Matrix), MultiomicsTracks96, which we used to interrogate matched frozen and formalin-fixed paraffin-embedded (FFPE) mouse organs. Using this system, we generated 8-dimensional omics data sets encompassing 4 molecular layers of intracellular organization: epigenome (H3K27Ac, H3K4m3, RNA polymerase II, and 5mC levels), transcriptome (messenger RNA levels), epitranscriptome (m6A levels), and proteome (protein levels) in brain, heart, kidney, and liver. There was a high correlation between data from matched frozen and FFPE organs. The Segway genome segmentation algorithm applied to epigenomic profiles confirmed known organ-specific superenhancers in both FFPE and frozen samples. Linear regression analysis showed that proteomic profiles, known to be poorly correlated with transcriptomic data, can be more accurately predicted by the full suite of multiomics data, compared with using epigenomic, transcriptomic, or epitranscriptomic measurements individually.


Asunto(s)
Formaldehído , Proteómica , Ratones , Animales , Fijadores , Fijación del Tejido/métodos , Proteómica/métodos , Adhesión en Parafina/métodos
9.
bioRxiv ; 2023 Sep 22.
Artículo en Inglés | MEDLINE | ID: mdl-37790381

RESUMEN

Most studies of genome organization have focused on intra-chromosomal (cis) contacts because they harbor key features such as DNA loops and topologically associating domains. Inter-chromosomal (trans) contacts have received much less attention, and tools for interrogating potential biologically relevant trans structures are lacking. Here, we develop a computational framework to identify sets of loci that jointly interact in trans from Hi-C data. This method, trans-C, initiates probabilistic random walks with restarts from a set of seed loci to traverse an input Hi-C contact network, thereby identifying sets of trans-contacting loci. We validate trans-C in three increasingly complex models of established trans contacts: the Plasmodium falciparum var genes, the mouse olfactory receptor "Greek islands", and the human RBM20 cardiac splicing factory. We then apply trans-C to systematically test the hypothesis that genes co-regulated by the same trans-acting element (i.e., a transcription or splicing factor) co-localize in three dimensions to form "RNA factories" that maximize the efficiency and accuracy of RNA biogenesis. We find that many loci with multiple binding sites of the same transcription factor interact with one another in trans, especially those bound by transcription factors with intrinsically disordered domains. Similarly, clustered binding of a subset of RNA binding proteins correlates with trans interaction of the encoding loci. These findings support the existence of trans interacting chromatin domains (TIDs) driven by RNA biogenesis. Trans-C provides an efficient computational framework for studying these and other types of trans interactions, empowering studies of a poorly understood aspect of genome architecture.

10.
bioRxiv ; 2023 Oct 20.
Artículo en Inglés | MEDLINE | ID: mdl-37905060

RESUMEN

Cross-species comparison and prediction of gene expression profiles are important to understand regulatory changes during evolution and to transfer knowledge learned from model organisms to humans. Single-cell RNA-seq (scRNA-seq) profiles enable us to capture gene expression profiles with respect to variations among individual cells; however, cross-species comparison of scRNA-seq profiles is challenging because of data sparsity, batch effects, and the lack of one-to-one cell matching across species. Moreover, single-cell profiles are challenging to obtain in certain biological contexts, limiting the scope of hypothesis generation. Here we developed Icebear, a neural network framework that decomposes single-cell measurements into factors representing cell identity, species, and batch factors. Icebear enables accurate prediction of single-cell gene expression profiles across species, thereby providing high-resolution cell type and disease profiles in under-characterized contexts. Icebear also facilitates direct cross-species comparison of single-cell expression profiles for conserved genes that are located on the X chromosome in eutherian mammals but on autosomes in chicken. This comparison, for the first time, revealed evolutionary and diverse adaptations of X-chromosome upregulation in mammals.

11.
Nat Commun ; 14(1): 5086, 2023 08 22.
Artículo en Inglés | MEDLINE | ID: mdl-37607941

RESUMEN

The complex life cycle of Plasmodium falciparum requires coordinated gene expression regulation to allow host cell invasion, transmission, and immune evasion. Increasing evidence now suggests a major role for epigenetic mechanisms in gene expression in the parasite. In eukaryotes, many lncRNAs have been identified to be pivotal regulators of genome structure and gene expression. To investigate the regulatory roles of lncRNAs in P. falciparum we explore the intergenic lncRNA distribution in nuclear and cytoplasmic subcellular locations. Using nascent RNA expression profiles, we identify a total of 1768 lncRNAs, of which 718 (~41%) are novels in P. falciparum. The subcellular localization and stage-specific expression of several putative lncRNAs are validated using RNA-FISH. Additionally, the genome-wide occupancy of several candidate nuclear lncRNAs is explored using ChIRP. The results reveal that lncRNA occupancy sites are focal and sequence-specific with a particular enrichment for several parasite-specific gene families, including those involved in pathogenesis and sexual differentiation. Genomic and phenotypic analysis of one specific lncRNA demonstrate its importance in sexual differentiation and reproduction. Our findings bring a new level of insight into the role of lncRNAs in pathogenicity, gene regulation and sexual differentiation, opening new avenues for targeted therapeutic strategies against the deadly malaria parasite.


Asunto(s)
Malaria Falciparum , Malaria , Parásitos , ARN Largo no Codificante , Humanos , Animales , Plasmodium falciparum/genética , ARN Largo no Codificante/genética , Malaria Falciparum/genética
12.
Bioinformatics ; 39(7)2023 07 01.
Artículo en Inglés | MEDLINE | ID: mdl-37421399

RESUMEN

MOTIVATION: Modality matching in single-cell omics data analysis-i.e. matching cells across datasets collected using different types of genomic assays-has become an important problem, because unifying perspectives across different technologies holds the promise of yielding biological and clinical discoveries. However, single-cell dataset sizes can now reach hundreds of thousands to millions of cells, which remain out of reach for most multimodal computational methods. RESULTS: We propose LSMMD-MA, a large-scale Python implementation of the MMD-MA method for multimodal data integration. In LSMMD-MA, we reformulate the MMD-MA optimization problem using linear algebra and solve it with KeOps, a CUDA framework for symbolic matrix computation in Python. We show that LSMMD-MA scales to a million cells in each modality, two orders of magnitude greater than existing implementations. AVAILABILITY AND IMPLEMENTATION: LSMMD-MA is freely available at https://github.com/google-research/large_scale_mmdma and archived at https://doi.org/10.5281/zenodo.8076311.


Asunto(s)
Genoma , Genómica , Genómica/métodos , Proyectos de Investigación , Análisis de Datos , Análisis de la Célula Individual , Programas Informáticos
13.
Nat Commun ; 14(1): 3303, 2023 06 06.
Artículo en Inglés | MEDLINE | ID: mdl-37280210

RESUMEN

Nuclear compartments are prominent features of 3D chromatin organization, but sequencing depth limitations have impeded investigation at ultra fine-scale. CTCF loops are generally studied at a finer scale, but the impact of looping on proximal interactions remains enigmatic. Here, we critically examine nuclear compartments and CTCF loop-proximal interactions using a combination of in situ Hi-C at unparalleled depth, algorithm development, and biophysical modeling. Producing a large Hi-C map with 33 billion contacts in conjunction with an algorithm for performing principal component analysis on sparse, super massive matrices (POSSUMM), we resolve compartments to 500 bp. Our results demonstrate that essentially all active promoters and distal enhancers localize in the A compartment, even when flanking sequences do not. Furthermore, we find that the TSS and TTS of paused genes are often segregated into separate compartments. We then identify diffuse interactions that radiate from CTCF loop anchors, which correlate with strong enhancer-promoter interactions and proximal transcription. We also find that these diffuse interactions depend on CTCF's RNA binding domains. In this work, we demonstrate features of fine-scale chromatin organization consistent with a revised model in which compartments are more precise than commonly thought while CTCF loops are more protracted.


Asunto(s)
Cromatina , Elementos de Facilitación Genéticos , Cromatina/genética , Factor de Unión a CCCTC/genética , Factor de Unión a CCCTC/metabolismo , Elementos de Facilitación Genéticos/genética , Núcleo Celular/genética , Núcleo Celular/metabolismo , Regiones Promotoras Genéticas
14.
PLoS Comput Biol ; 19(5): e1011049, 2023 05.
Artículo en Inglés | MEDLINE | ID: mdl-37146053

RESUMEN

Single cell ATAC-seq (scATAC-seq) enables the mapping of regulatory elements in fine-grained cell types. Despite this advance, analysis of the resulting data is challenging, and large scale scATAC-seq data are difficult to obtain and expensive to generate. This motivates a method to leverage information from previously generated large scale scATAC-seq or scRNA-seq data to guide our analysis of new scATAC-seq datasets. We analyze scATAC-seq data using latent Dirichlet allocation (LDA), a Bayesian algorithm that was developed to model text corpora, summarizing documents as mixtures of topics defined based on the words that distinguish the documents. When applied to scATAC-seq, LDA treats cells as documents and their accessible sites as words, identifying "topics" based on the cell type-specific accessible sites in those cells. Previous work used uniform symmetric priors in LDA, but we hypothesized that nonuniform matrix priors generated from LDA models trained on existing data sets may enable improved detection of cell types in new data sets, especially if they have relatively few cells. In this work, we test this hypothesis in scATAC-seq data from whole C. elegans nematodes and SHARE-seq data from mouse skin cells. We show that nonsymmetric matrix priors for LDA improve our ability to capture cell type information from small scATAC-seq datasets.


Asunto(s)
Algoritmos , Caenorhabditis elegans , Animales , Ratones , Caenorhabditis elegans/genética , Teorema de Bayes , Cromatina , Secuencias Reguladoras de Ácidos Nucleicos , Análisis de la Célula Individual/métodos
15.
bioRxiv ; 2023 Dec 11.
Artículo en Inglés | MEDLINE | ID: mdl-37131601

RESUMEN

Long-read DNA sequencing has recently emerged as a powerful tool for studying both genetic and epigenetic architectures at single-molecule and single-nucleotide resolution. Long-read epigenetic studies encompass both the direct identification of native cytosine methylation as well as the identification of exogenously placed DNA N6-methyladenine (DNA-m6A). However, detecting DNA-m6A modifications using single-molecule sequencing, as well as co-processing single-molecule genetic and epigenetic architectures, is limited by computational demands and a lack of supporting tools. Here, we introduce fibertools, a state-of-the-art toolkit that features a semi-supervised convolutional neural network for fast and accurate identification of m6A-marked bases using PacBio single-molecule long-read sequencing, as well as the co-processing of long-read genetic and epigenetic data produced using either PacBio or Oxford Nanopore sequencing platforms. We demonstrate accurate DNA-m6A identification (>90% precision and recall) along >20 kilobase long DNA molecules with a ~1,000-fold improvement in speed. In addition, we demonstrate that fibertools can readily integrate genetic and epigenetic data at single-molecule resolution, including the seamless conversion between molecular and reference coordinate systems, allowing for accurate genetic and epigenetic analyses of long-read data within structurally and somatically variable genomic regions.

16.
bioRxiv ; 2023 Apr 05.
Artículo en Inglés | MEDLINE | ID: mdl-37066300

RESUMEN

The house mouse, Mus musculus, is an exceptional model system, combining genetic tractability with close homology to human biology. Gestation in mouse development lasts just under three weeks, a period during which its genome orchestrates the astonishing transformation of a single cell zygote into a free-living pup composed of >500 million cells. Towards a global framework for exploring mammalian development, we applied single cell combinatorial indexing (sci-*) to profile the transcriptional states of 12.4 million nuclei from 83 precisely staged embryos spanning late gastrulation (embryonic day 8 or E8) to birth (postnatal day 0 or P0), with 2-hr temporal resolution during somitogenesis, 6-hr resolution through to birth, and 20-min resolution during the immediate postpartum period. From these data (E8 to P0), we annotate dozens of trajectories and hundreds of cell types and perform deeper analyses of the unfolding of the posterior embryo during somitogenesis as well as the ontogenesis of the kidney, mesenchyme, retina, and early neurons. Finally, we leverage the depth and temporal resolution of these whole embryo snapshots, together with other published data, to construct and curate a rooted tree of cell type relationships that spans mouse development from zygote to pup. Throughout this tree, we systematically nominate sets of transcription factors (TFs) and other genes as candidate drivers of the in vivo differentiation of hundreds of mammalian cell types. Remarkably, the most dramatic shifts in transcriptional state are observed in a restricted set of cell types in the hours immediately following birth, and presumably underlie the massive changes in physiology that must accompany the successful transition of a placental mammal to extrauterine life.

17.
bioRxiv ; 2023 Mar 20.
Artículo en Inglés | MEDLINE | ID: mdl-36993219

RESUMEN

Background: The multiome is an integrated assembly of distinct classes of molecules and molecular properties, or "omes," measured in the same biospecimen. Freezing and formalin-fixed paraffin-embedding (FFPE) are two common ways to store tissues, and these practices have generated vast biospecimen repositories. However, these biospecimens have been underutilized for multi-omic analysis due to the low throughput of current analytical technologies that impede large-scale studies. Methods: Tissue sampling, preparation, and downstream analysis were integrated into a 96-well format multi-omics workflow, MultiomicsTracks96. Frozen mouse organs were sampled using the CryoGrid system, and matched FFPE samples were processed using a microtome. The 96-well format sonicator, PIXUL, was adapted to extract DNA, RNA, chromatin, and protein from tissues. The 96-well format analytical platform, Matrix, was used for chromatin immunoprecipitation (ChIP), methylated DNA immunoprecipitation (MeDIP), methylated RNA immunoprecipitation (MeRIP), and RNA reverse transcription (RT) assays followed by qPCR and sequencing. LC-MS/MS was used for protein analysis. The Segway genome segmentation algorithm was used to identify functional genomic regions, and linear regressors based on the multi-omics data were trained to predict protein expression. Results: MultiomicsTracks96 was used to generate 8-dimensional datasets including RNA-seq measurements of mRNA expression; MeRIP-seq measurements of m6A and m5C; ChIP-seq measurements of H3K27Ac, H3K4m3, and Pol II; MeDIP-seq measurements of 5mC; and LC-MS/MS measurements of proteins. We observed high correlation between data from matched frozen and FFPE organs. The Segway genome segmentation algorithm applied to epigenomic profiles (ChIP-seq: H3K27Ac, H3K4m3, Pol II; MeDIP-seq: 5mC) was able to recapitulate and predict organ-specific super-enhancers in both FFPE and frozen samples. Linear regression analysis showed that proteomic expression profiles can be more accurately predicted by the full suite of multi-omics data, compared to using epigenomic, transcriptomic, or epitranscriptomic measurements individually. Conclusions: The MultiomicsTracks96 workflow is well suited for high dimensional multi-omics studies - for instance, multiorgan animal models of disease, drug toxicities, environmental exposure, and aging as well as large-scale clinical investigations involving the use of biospecimens from existing tissue repositories.

18.
bioRxiv ; 2023 Mar 06.
Artículo en Inglés | MEDLINE | ID: mdl-36945371

RESUMEN

The human genome contains millions of candidate cis-regulatory elements (CREs) with cell-type-specific activities that shape both health and myriad disease states. However, we lack a functional understanding of the sequence features that control the activity and cell-type-specific features of these CREs. Here, we used lentivirus-based massively parallel reporter assays (lentiMPRAs) to test the regulatory activity of over 680,000 sequences, representing a nearly comprehensive set of all annotated CREs among three cell types (HepG2, K562, and WTC11), finding 41.7% to be functional. By testing sequences in both orientations, we find promoters to have significant strand orientation effects. We also observe that their 200 nucleotide cores function as non-cell-type-specific 'on switches' providing similar expression levels to their associated gene. In contrast, enhancers have weaker orientation effects, but increased tissue-specific characteristics. Utilizing our lentiMPRA data, we develop sequence-based models to predict CRE function with high accuracy and delineate regulatory motifs. Testing an additional lentiMPRA library encompassing 60,000 CREs in all three cell types, we further identified factors that determine cell-type specificity. Collectively, our work provides an exhaustive catalog of functional CREs in three widely used cell lines, and showcases how large-scale functional measurements can be used to dissect regulatory grammar.

19.
J Proteome Res ; 22(2): 577-584, 2023 02 03.
Artículo en Inglés | MEDLINE | ID: mdl-36633229

RESUMEN

The first step in the analysis of protein tandem mass spectrometry data typically involves searching the observed spectra against a protein database. During database search, the search engine must digest the proteins in the database into peptides, subject to digestion rules that are under user control. The choice of these digestion parameters, as well as selection of post-translational modifications (PTMs), can dramatically affect the size of the search space and hence the statistical power of the search. The Tide search engine separates the creation of the peptide index from the database search step, thereby saving time by allowing a peptide index to be reused in multiple searches. Here we describe an improved implementation of the indexing component of Tide that consumes around four times less resources (CPU and RAM) than the previous version and can generate arbitrarily large peptide databases, limited by only the amount of available disk space. We use this improved implementation to explore the relationship between database size and the parameters controlling digestion and PTMs, as well as database size and statistical power. Our results can help guide practitioners in proper selection of these important parameters.


Asunto(s)
Algoritmos , Péptidos , Péptidos/química , Proteínas/metabolismo , Motor de Búsqueda , Bases de Datos de Proteínas , Programas Informáticos
20.
Biometrics ; 79(4): 3472-3484, 2023 12.
Artículo en Inglés | MEDLINE | ID: mdl-36652258

RESUMEN

Recently, Barber and Candès laid the theoretical foundation for a general framework for false discovery rate (FDR) control based on the notion of "knockoffs." A closely related FDR control methodology has long been employed in the analysis of mass spectrometry data, referred to there as "target-decoy competition" (TDC). However, any approach that aims to control the FDR, which is defined as the expected value of the false discovery proportion (FDP), suffers from a problem. Specifically, even when successfully controlling the FDR at level α, the FDP in the list of discoveries can significantly exceed α. We offer FDP-SD, a new procedure that rigorously controls the FDP in the knockoff/TDC competition setup by guaranteeing that the FDP is bounded by α at a desired confidence level. Compared with the recently published framework of Katsevich and Ramdas, FDP-SD generally delivers more power and often substantially so in simulated and real data.


Asunto(s)
Algoritmos , Espectrometría de Masas , Reacciones Falso Positivas
SELECCIÓN DE REFERENCIAS
DETALLE DE LA BÚSQUEDA
...